-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(dataobj): data object encoding and decoding #15676
Conversation
This commit introduces the encoding package with utilities for writing and reading a data object. This initial commit includes a single section called "streams". The streams section holds a list of streams for which logs are available in the data object file. This does not hold the logs themselves, but rather just the stream labels themselves with an ID. Encoding -------- Encoding presents a hierarchical API to match the file structure: 1. Callers open an encoder 2. Callers open a streams section from the encoder 3. Callers open a column from the streams section 4. Callers append a page into the column Child elements of the hierarchy have a Commit method to flush their written data and metadata to their parent. Each element of the hierarchy exposes its current MetadataSize. Callers should use MetadataSize to control the size of an element. For example, if Encoder.MetadataSize goes past a limit, callers should stop appending new sections to the file and flush the file to disk. To support discarding data after reaching a size limit, each child element of the hierarchy also has a Discard method. Decoding -------- Decoding separates each section into a different Decoder interface to more cleanly separate the APIs. The initial Decoder is for ReadSeekers, but later implementations will include object storage and caching. The Decoder interfaces are designed for batch reading, so that callers can retrieve multiple columns or pages at once. Implementations can then use this to reduce the number of roundtrips (such as retrieving mulitple cache keys in a single cache request). encoding.StreamsDataset converts an instance of a StreamDecoder into a dataset.Dataset, allowing to use the existing dataset utility functions without downloading an entire dataset.
eb7c9cc
to
2c48dc1
Compare
|
||
var protoBufferPool = sync.Pool{ | ||
New: func() any { | ||
return new(proto.Buffer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cyriltovena I initially set proto.Buffer.SetDeterministic
here to have deterministic encoding of protobufs but I think there's a bug in gogo protobuf that prevents it from working.
Either way, I think our encoding is already deterministic as long as we never include map types in our protobuf. I'll have some tests for that once we have the final pieces that tie everything together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR introduces metadata for a "logs section," which is intended to hold a sequence of log records across one or more streams. The code is a near-identical copy of grafana#15676. Future work is needed to identify if dataset sections can have their encoding, decoding, and dataset implementations deduplicated.
This commit introduces the encoding package with utilities for writing and reading a data object.
This initial commit includes a single section called "streams". The streams section holds a list of streams for which logs are available in the data object file. This does not hold the logs themselves, but rather just the stream labels themselves with an ID.
Encoding
Encoding presents a hierarchical API to match the file structure:
Child elements of the hierarchy have a Commit method to flush their written data and metadata to their parent.
Each element of the hierarchy exposes its current MetadataSize. Callers should use MetadataSize to control the size of an element. For example, if Encoder.MetadataSize goes past a limit, callers should stop appending new sections to the file and flush the file to disk.
To support discarding data after reaching a size limit, each child element of the hierarchy also has a Discard method.
Decoding
Decoding separates each section into a different Decoder interface to more cleanly separate the APIs.
The initial Decoder is for ReadSeekers, but later implementations will include object storage and caching.
The Decoder interfaces are designed for batch reading, so that callers can retrieve multiple columns or pages at once. Implementations can then use this to reduce the number of roundtrips (such as retrieving mulitple cache keys in a single cache request).
encoding.StreamsDataset converts an instance of a StreamDecoder into a dataset.Dataset, allowing to use the existing dataset utility functions without downloading an entire dataset.